Variational Mode Seeking

ABSTRACT

A mode-seeking clustering mechanism identifies clusters within a data set based on the location of individual data point according to modes in a kernel density estimate. For large-scale applications the clustering mechanism may utilize rough hierarchical kernel and data partitions in a computationally efficient manner. A variational approach to the clustering mechanism may take into account variational probabilities, which are restricted in certain ways according to hierarchical kernel and data partition trees, and the mechanism may store certain statistics within these trees in order to compute the variational probabilities in a computational efficient way. The clustering mechanism may use a two-step variational expectation and maximization algorithm and generalizations hereof, where the maximization step may be performed in different ways in order to accommodate different mode-seeking algorithms, such as the mean shift, mediod shift, and quick shift algorithms.

BACKGROUND

Clustering is a component of machine learning, data mining, pattern recognition, image analysis, and bioinformatics. Clustering is sometimes known as automatic classification, botryology, topological analysis, cluster analysis, and other terms.

SUMMARY

A mode-seeking clustering mechanism identifies clusters within a data set based on the location of individual data points according to modes in a kernel density estimate. For large-scale applications the clustering mechanism may utilize rough hierarchical kernel and data partitions in a computationally efficient manner. A variational approach to the clustering mechanism may take into account variational probabilities, which may be restricted in certain ways according to hierarchical kernel and data partition trees, and the mechanism may store certain statistics within these trees in order to compute the variational probabilities in a computational efficient way. Some embodiments may employ a marked partition tree that may incorporate information from both a kernel tree and a data partition tree. The clustering mechanism may use a two-step variational expectation and maximization algorithm and generalizations hereof, where the maximization step may be performed in different ways in order to accommodate different mode-seeking algorithms, such as the mean shift, mediod shift, and quick shift algorithms.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings,

FIG. 1 is a diagram illustration of an embodiment showing a system with a clustering mechanism.

FIG. 2 is a diagram illustration of an example embodiment showing a fixed kernel partition, dynamic data partition, and block partition.

FIG. 3 is an illustration of an embodiment showing a first property definition.

FIG. 4 is an illustration of an embodiment showing a second property definition.

FIG. 5 is a diagram illustration of an example embodiment showing a marked partition tree.

FIG. 6 is an illustration of an embodiment showing a recursive block definition method.

FIG. 7 is an illustration of an example embodiment showing an algorithmic description of an E-step.

FIG. 8 is a flowchart illustration of an embodiment showing a method for mode seeking.

DETAILED DESCRIPTION

A variational mode seeking mechanism may move data points together into groups of data points that define clusters. Data points in different clusters may be considered different and the clustering can therefore be used to reason about the data in different ways for many different applications.

An efficient mechanism for mode seeking may use a variational Expectation-Maximization algorithm or a variational generalized Expectation-Maximization algorithm, and may use two types of approximation hierarchies: a hierarchy of dynamic data points that may be moved towards the modes or clusters, and a hierarchy of kernels in an underlying fixed kernel density function. In the hierarchies, statistics may be stored for portions of the data and the kernels so that a learning algorithm may reason from the portions of data rather than the individual data points.

The mechanism may allow for homoscedastic and heteroscedastic data models, and may further enable mean-shifting, mediod-shifting, and quick-shifting techniques to be applied to the data.

Throughout this specification, like reference numbers signify the same elements throughout the description of the figures.

When elements are referred to as being “connected” or “coupled,” the elements can be directly connected or coupled together or one or more intervening elements may also be present. In contrast, when elements are referred to as being “directly connected” or “directly coupled,” there are no intervening elements present.

The subject matter may be embodied as devices, systems, methods, and/or computer program products. Accordingly, some or all of the subject matter may be embodied in hardware and/or in software (including firmware, resident software, micro-code, state machines, gate arrays, etc.) Furthermore, the subject matter may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The computer-usable or computer-readable medium may be for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and may be accessed by an instruction execution system. Note that the computer-usable or computer-readable medium can be paper or other suitable medium upon which the program is printed, as the program can be electronically captured via, for instance, optical scanning of the paper or other suitable medium, then compiled, interpreted, of otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” can be defined as a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above-mentioned should also be included within the scope of computer-readable media.

When the subject matter is embodied in the general context of computer-executable instructions, the embodiment may comprise program modules, executed by one or more systems, computers, or other devices. Generally, program modules include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.

FIG. 1 is a diagram of an embodiment 100, showing a system that may identify clusters within a data set. Embodiment 100 is a simplified example of a computer system that may process a data set using a variational mode seeking analysis mechanism and produce clusters from the data set.

The diagram of FIG. 1 illustrates functional components of a system. In some cases, the component may be a hardware component, a software component, or a combination of hardware and software. Some of the components may be application level software, while other components may be operating system level components. In some cases, the connection of one component to another may be a close connection where two or more components are operating on a single hardware platform. In other cases, the connections may be made over network connections spanning long distances. Each embodiment may use different hardware, software, and interconnection architectures to achieve the described functions.

Embodiment 100 illustrates a device 102 that may be used to perform clustering. The device 102 may be a conventional computer device that may operate software components that may perform clustering operations. In other embodiments, the operations of the software components may be embodied in a gate array, programmable logic array, or other hardware component. The architecture of embodiment 100 is merely one example of a mechanism that may implement the mode seeking mechanism described below.

Device 102 may have a hardware platform 104 and various software components 106. The hardware platform 104 may comprise a processor 108, random access memory 110, and nonvolatile storage 112. The device 102 may also have a user interface 114 and a network interface 116.

In many embodiments, the device 102 may be a personal computer or code development workstation. The device 102 may also be a server computer, desktop computer, or comparable device. In some embodiments, the device 102 may still also be a laptop computer, netbook computer, tablet or slate computer, wireless handset, cellular telephone, game console, or any other type of computing device.

The hardware components 104 may include a processor 108, random access memory 110, and nonvolatile storage 112. The hardware components 104 may also include a user interface 114 and network interface 116. The processor 108 may be made up of several processors or processor cores in some embodiments. The random access memory 110 may be memory that may be readily accessible to and addressable by the processor 108. The nonvolatile storage 112 may be storage that persists after the device 102 is shut down. The nonvolatile storage 112 may be any type of storage device, including hard disk, solid state memory devices, magnetic tape, optical storage, or other type of storage. The nonvolatile storage 112 may be read only or read/write capable.

The user interface 114 may be any type of hardware capable of displaying output and receiving input from a user. In many cases, the output display may be a graphical display monitor, although output devices may include lights and other visual output, audio output, kinetic actuator output, as well as other output devices. Conventional input devices may include keyboards and pointing devices such as a mouse, stylus, trackball, or other pointing device. Other input devices may include various sensors, including biometric input devices, audio and video input devices, and other sensors.

The network interface 116 may be any type of connection to another computer. In many embodiments, the network interface 116 may be a wired Ethernet connection. Other embodiments may include wired or wireless connections over various communication protocols.

The software components 106 may include an operating system 118 on which various applications and services may operate. An operating system may provide an abstraction layer between executing routines and the hardware components 104, and may include various routines and functions that communicate directly with various hardware components.

The software components 106 may include a clustering mechanism 120 which may process a data set 122 to create cluster definitions 124. The clustering mechanism 120 may further utilize a fixed kernel partition 126 and a dynamic data partition 128 in order to perform the clustering in a computational efficient way.

The clustering mechanism 120 may execute an algorithm that may find clusters within a data set 122. The clustering mechanism may use a mode seeking algorithm, examples of which are defined below. The clustering mechanism 120 may use an expectation-maximization (EM) algorithm view on the mode-seeking clustering, where the Expectation step may generalize or approximate the Expectation step by breaking the product space of data and kernels into chunks, and processing these chunks as a whole. Such a mechanism may lower computational time while sacrificing a minimum of accuracy.

Mode Seeking

Let μ=(μ₁, . . . , μ_(M)) denote a set of fixed data points for which the values cannot change, and let x=(x₁, . . . , x_(N)) denote a set of dynamic data points that are allowed to change (or shift) their values (or positions).

Mode-seeking clustering algorithms form a class of non-parametric clustering (or segmentation) algorithms that conceptually work as follows:

(1) Given the fixed μ₁, . . . , μ_(M), create a copy to be used as the dynamic x₁, . . . , x_(N). (Hence, M=N.)

(2) For each dynamic data point x_(n) iteratively move this data points uphill towards a mode in the underlying kernel density function, defined as the finite mixture

P(x _(n))=Σ_(m=1) ^(M) p(m)p(x _(n)|μ_(m),Σ_(m)),  (1)

where p(m) is the weight (or mixing proportion) for kernel m, and

${p\left( {\left. x_{n} \middle| \mu_{m} \right.,\sum\limits_{m}} \right)} = {\frac{1}{Z_{m}}{K\left( {D_{\sum\limits_{m}}\left( {x_{n},\mu_{m}} \right)} \right)}}$

is a normalized Gaussian kernel function with profile

${{K(z)} = ^{{- \frac{1}{2}}z}},$

location μ_(m), bandwidth Σ_(m), normalization Z_(m), and

${D_{\sum\limits_{m}}\left( {x_{n},\mu_{m}} \right)} = {\left( {x_{n} - \mu_{m}} \right)^{T}{\underset{m}{\sum\limits^{- 1}}\left( {x_{n} - \mu_{m}} \right)}}$

is the Mahalanobis distance. For mode seeking algorithms it is standard that

${p(m)} = {\frac{1}{M}.}$

(3) Assign μ_(i) and μ_(i) to the same cluster if their associated dynamic points x_(i) and x_(j) have moved to the same mode for the kernel density function.

Mean-Shift & its EM Interpretation

Mean-shift is one version of a mode-seeking algorithm. The mean shift version of the mode seeking clustering algorithm may operate within the framework of the Expectation-Maximization (EM) algorithm. Conceptually, in this view, instead of maximizing the log-likelihood with respect to the parameters in the mixture model defined by the kernel density estimate, a log-likelihood is maximized with respect to each of the dynamic data points. That is, the EM view iteratively maximizes the log-likelihood

l(x _(n))=log p(x _(n))=log Σ_(m=1) ^(M) p(m)p(x _(n)|μ_(m),Σ_(m))  (2)

independently for each dynamic point x_(n), n=1, . . . , N, where each iteration consists of the following two steps:

E-step: Compute the posterior mixture-component membership probabilities p(m|x_(n)) for m=1, . . . , M.

M-step: Maximize E_(p(m|x) _(n) )[log(p(m)p(x|μ_(m), Σ_(m)))] with respect to x; where E_(p(m|x) _(n) ₎[.] denotes the expectation with respect to the distribution p(m|x_(rt)), m=1, . . . , M. The maximization leads to the following update for the individual d-dimensional data points

$\begin{matrix} {\begin{matrix} \left. x_{n}\leftarrow {{\arg \; {\max_{x \in R^{d}}{\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}\log \; {p\left( x \middle| m \right)}}}}} + C} \right. \\ {= {\arg \; {\min_{x \in R^{d}}{\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}{D_{\sum\limits_{m}}\left( {x,\mu_{m}} \right)}}}}}} \\ {= {\left( {\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}\sum\limits_{m}^{- 1}}} \right)^{- 1}{\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}{\sum\limits_{m}^{- 1}\mu_{m}}}}}} \end{matrix}{where}\begin{matrix} {{C = {\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}\log \; {p(m)}\mspace{14mu} {is}\mspace{14mu} {independent}\mspace{14mu} {of}\mspace{14mu} x}}},{{and}\mspace{14mu} {p\left( m \middle| x_{n} \right)}}} \\ {= {\frac{p\left( x_{n} \middle| m \right)}{{Mp}\left( x_{n} \right)}{for}\mspace{14mu} {p(m)}}} \\ {= {\frac{1}{M}.}} \end{matrix}} & (3) \end{matrix}$

The updated x_(n) is now used for the next iteration of EM, and this process continues until convergence.

Medoid-Shift & its (generalized) EM Interpretation

The EM interpretation for mean-shift opens up for many mode-seeking variations, because the partitioning of each iteration into two steps (the E- and M-steps) offers a flexibility that is conceptually easy to understand. The medoid-shift algorithm may be a variation of the mean-shift algorithm, where the M-step in the above EM interpretation may be modified by computing the new position of x_(n) as the weighted

$\begin{matrix} \begin{matrix} \left. x_{n}\leftarrow {{\arg \; {\max\limits_{{x = \mu_{m}},{m = 1},\; \ldots \mspace{11mu},M}{\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}\log \; {p\left( x \middle| m \right)}}}}} + C} \right. \\ {= {\arg \; {\max\limits_{{x = \mu_{m}},{m = 1},\; \ldots \mspace{11mu},M}{\sum\limits_{m = 1}^{M}{{p\left( m \middle| x_{n} \right)}{{D_{\sum\limits_{m}}\left( {x,\mu_{m}} \right)}.}}}}}} \end{matrix} & (4) \end{matrix}$

By restricting the optimization to only allow the update for x_(n) to assign values from the fixed data points μ, the M-step will increase the unrestricted log-likelihood but not necessarily maximize it, and therefore this interpretation of the medoid-shift is a generalized EM interpretation.

Quick-Shift & its (Generalized) EM Interpretation

By realizing that it is not necessary to minimize (4) in the M-step of the medoid-shift algorithm in order to obtain a convergent (generalized) EM algorithm, we can define a quick-shift algorithm, which in spirit is similar to other quick-shift algorithms. This quick-shift algorithm will simply assign the nearest neighbor that decreases (and not necessarily minimizes) the value in (4) to the updated value for x_(n).

Variational Mode Seeking

The variational mode seeking algorithm may use any of the mode-seeking algorithms that can be explained in the (generalized) EM view with different M-step variations. A variational distribution may be introduced as:

q(m|n), m=1, M and n=1, . . . , N  (5)

The posterior mixture-component membership probabilities p(m|x_(n)) may be approximated as q(m|n) in the E-step as described below. Using this approximation, the algorithm is still convergent, but will converge to a lower bound of the log-likelihood, where the tightness of this bound is determined by the quality of the approximation.

In the variational mode-seeking approach, each dynamic data point may not be considered individually, but instead the joint log-likelihood of the independent data points will be considered:

$\begin{matrix} \begin{matrix} {{l(x)} = {l\left( {x_{1},\ldots \mspace{14mu},x_{n}} \right)}} \\ {= {\sum\limits_{n = 1}^{N}{\log \; {p\left( x_{n} \right)}}}} \\ {{= {\sum\limits_{n = 1}^{N}{\log \; {\sum\limits_{m = 1}^{M}{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}}}}}\;} \end{matrix} & (6) \end{matrix}$

Introducing the variational distribution q(m|n) and using Jensen's inequality

$\begin{matrix} {\begin{matrix} {{l(x)} = {\sum\limits_{n = 1}^{N}{\log \; {\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}}}}}} \\ {\geq {\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\log \frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}}}}} \\ {= {{F\left( {q,x} \right)}.(8)}} \end{matrix}{And}} & (7) \\ {{{{l(x)} - {F\left( {q,x} \right)}} = {{KL}\left\lbrack {{q\left( {\cdot \left. n \right)} \right.}\left. {{p\left( \cdot  \right.}x_{n}} \right)} \right\rbrack}},} & (9) \end{matrix}$

where KL[·|51 ·] denotes the Kullback-Leibler divergence between the two distributions in the arguments.

Equation (9) can also be realized from the following derivation

$\begin{matrix} \begin{matrix} {{{l(x)} - {F\left( {q,x} \right)}} = {{\sum\limits_{n = 1}^{N}{\log \; {\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}}}}} -}} \\ {{\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\log \frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}}}}} \\ {= {{\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\log {\sum\limits_{m = 1}^{M}{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}}}}} -}} \\ {{\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\log \frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}}}}} \\ {= {\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\begin{pmatrix} {{\log {\sum\limits_{m = 1}^{M}{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}}} -} \\ {\log \frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{q\left( m \middle| n \right)}} \end{pmatrix}}}}} \\ {= {\sum\limits_{n = 1}^{N}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}\log \frac{q\left( m \middle| n \right)}{\frac{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}{\sum\limits_{m = 1}^{M}{{p(m)}{p\left( {\left. x_{n} \middle| \mu_{m} \right.,\Sigma_{m}} \right)}}}}}}}} \\ {= {{{KL}\left\lbrack {{q\left( {m\left. n \right)} \right.}\left. {{p\left( m \right.}x_{n}} \right)} \right\rbrack}.}} \end{matrix} & (10) \end{matrix}$

An iteration in the variational EM view on mode-seeking algorithms performs coordinate ascent on F(q, x) by the following two steps

E-step: q^(t+1)=arg max_(q) F (q, x^(t))=arg min_(q)KL[q(m|n)∥p(m|x_(n) ^(t))]

M-Step:

For Mean-shift:

$\begin{matrix} {x^{t + 1} = {\arg \; {\max\limits_{{x\text{:}x_{n}} \in {R^{d}{\forall n}}}{F\left( {q^{t + 1},x} \right)}}}} & (11) \end{matrix}$

leading to the following update for individual data points

For Medoid-Shift:

$\begin{matrix} {x^{t + 1} = {\arg \; {\max\limits_{{x\text{:}x_{n}} \in {\mu \; {\forall n}}}{F\left( {q^{t + 1},x} \right)}}}} & (12) \end{matrix}$

leading to the following update for individual data points

$\begin{matrix} {x_{n}^{t + 1} = {\arg \; {\min\limits_{x \in \mu}{\sum\limits_{m = 1}^{M}{{q\left( m \middle| n \right)}{d_{\Sigma_{m}}\left( {x,\mu_{m}} \right)}}}}}} & (13) \end{matrix}$

For Quick-shift: Similar to Medoid-shift, where the update for each individual data points is the nearest neighbor that decreases (and not necessarily minimizes) the value in (13).

Since the Kullback-Leibler divergence is always positive, q(m|In)=p(m|x_(n)) maximizes the E-step, and, in this case, the variational (generalized) EM algorithm is equivalent to the standard (generalized) EM algorithm.

In some embodiments, q(m|n) may be restricted to attain significant speedup in computational complexity over the standard mode-seeking algorithms, implying that KL[q(·|n)∥p(·|x_(n))] may not be zero, because p(·|x_(n)) is not in the allovable class of distributions for q(·|n). However, the E-step will minimize this KL divergence.

Data and Kernel Partition

Let us define a hierarchical partition for the fixed data points μ=, μ₁, . . . , μ_(m) and similar for the dynamic data points x=x₁, . . . , x_(N). Note that a kernel may be associated with each fixed data point, as indicated in Equation (1). The terms fixed data point and kernel are used interchangeably.

FIG. 2 illustrates an embodiment 200 showing a fixed kernel partition tree 202 and a dynamic data partition tree 204. We assume a partition of x×μ into into blocks {(A_(i), B₁), . . . , (A_(p), B_(p))}, where A_(p) ⊂x and B_(p) ⊂μ for p=1, . . . , P.

In FIG. 2, A_(p) and B_(p) correspond to nodes in, respectively, the dynamic data and the fixed kernel partition trees, and (A_(p), B_(p)) corresponds to a block (a square) in the partition table 206.

Computational efficiency may be increased by maintaining two partition trees: one for the data points that are shifted towards the modes and one for the kernels in the underlying fixed kernel density function. The two trees may be known respectively as a data partition tree and a kernel partition tree. A node in one of these trees may represent the union of the elements, either data points or kernels, for the elements represented by the node's children. Individual elements may not actually be stored in each node, but rather, the sufficient statistics for the estimation operations may be pre-computed and stored in a tree.

Any tree may be suitable for constructing a partition tree so long as there may be some degree of similarity between the data or kernels in a block.

The trees may represent computationally cheap, but rough, hierarchical clusterings of data and kernels. The clusterings may allow groups of data and kernels represented by a blocks in the partition table to be reasoned about in subsequent analyses.

An anchor tree construction may assume that all distance computations during the construction are performed according to the same isotropic distance metric. Such an assumption may be appropriate for constructing a data partition tree, because data simply specify a location in a given data space. Such a tree may be known as a homoscedastic anchor tree. In contrast to data, each kernel may additionally impose an individual distance metric, which may be included in the kernel partition tree. For homoscedastic kernels, all distance metrics may be the same.

The anchor tree construction algorithm may start with an anchor growing phase that for N elements gradually grows a set of √{square root over (N)} anchors A, which will become the leafs in an anchor sub-tree. Each anchor a εA has a pivot element a_(p) and may maintain a list of individual member elements a_(m)={a_(M) ^(i)}_(i), which may be data points or kernels sorted by decreasing order of distance to the pivot D(a_(M) ^(i), a_(p)), a_(M) ^(i)εa_(M). The list of elements in an anchor may have a characteristic that all elements in the anchor are closer to the pivot of that anchor than the pivot of any other anchor.

During the anchor growing phase, a new anchor a^(new) may be added to the current set of anchors and its pivot element may be chosen as the element with largest distance to the pivot of the anchor that currently owns it. That is,

$\begin{matrix} {a^{new} = {\underset{{a_{M}^{i} \in a_{M}},{a \in M}}{\arg \; \max}{D\left( {a_{M}^{i},a_{p}} \right)}}} & (14) \end{matrix}$

The new anchor may now iterate through the member elements of current anchors and may “steal” elements with D(a_(M) ^(i),a_(p) ^(new))<D(a_(M) ^(i),a_(p)). Because the list of elements in an anchor may already be sorted with respect to D (a_(M) ^(i),a_(p)) computational efficiency may be achieved by not evaluating the remaing elements in the list once we discover that D(a_(M) ^(i),a_(p))<D(a_(p) ^(new),a_(p))/2, which guarantees that these elements cannot be stolen. Once a new anchor is done stealing elements from older anchors, its list of elements is sorted and no further sorting is needed for this anchor.

The anchor sub-tree construction now contains a phase that recursively merges the pair of anchors that results in the smallest distance from the pivot for the element farthest away. The leafs or anchors in the tree now contains √{square root over (N)} elements on average and we can subdivide further by recursively calling the sub-tree construction algorithm on the data in each leaf.

The Heteroscedastic Anchor Tree

Recall the normalized kernel function for data point x

$\begin{matrix} {{K_{\mu,\Sigma}(x)} = {\frac{1}{Z_{\Sigma}}{K\left( {D_{\Sigma}\left( {x,\mu} \right)} \right)}}} & (15) \end{matrix}$

where μ and Σ are the kernel location and bandwidth matrix, respectively, Z_(Σ) is a normalization constant only depending on Σ, and K may be the profile depending on the Mahalanobis distance

D _(Σ)(x,μ)=(x−μ)^(T)Σ⁻¹(x−μ)  (16)

When constructing the kernel tree, information about data are not be available. In order to group kernels in a way to reason about the groups, kernels may be grouped according to distance in location. For the distance computations between two kernels a^(i)=K_(μ) _(i) _(,Σ) _(i) and a_(j)=K_(μ) _(j) _(,Σ) _(j) use the bandwidth that will imply the most variability on data far away. That is,

D(a ^(i) ,a ^(j))=min {D _(Σ) _(j) (μ_(i) ,μj),D _(Σ) _(i) (μ_(j),μ_(i))}  (17)

We will assume that that kernel heteroscedacity stems from differently scaled bandwidths. For an arbitrary kernel with bandwidth Σ, the scaled bandwidths in the different kernels can now be represented as Σ_(i)=a_(i)Σ. Hence,

${D\left( {a^{i},a^{j}} \right)} = {\frac{1}{\alpha \left( {a_{i},a_{j}} \right)}{D_{\Sigma}\left( {a^{i},a^{j}} \right)}}$

where α(a_(i), a_(i))=max {a_(i), a_(j)} and D_(Σ)(·,·) is the distance computed with respect to Σ.

The heteroscedastic anchor tree construction may proceed as for the homoscedastic case with this new distance measure and a modification to the criterion for when to stop traversing the sorted list of elements in an anchor. The property of FIG. 3 may be used to derive this criterion.

Notice that in the case of homoscedastic kernels, the conditioning statement in the property of FIG. 3 reduces to the criterion D(a_(M) ^(i),a_(p))<D(a_(p) ^(new),a_(p))/2 that guarantees that the remaining elements in a_(M) cannot be stolen once the condition is satisfied for a_(M) ^(i). This guarantee follows directly by recalling that a_(M) may be sorted with respect to distance from a_(p) (high to low), and therefore D(a_(p),a_(M) ^(i+k))≦D (a_(p),a_(M) ^(i)) for k>0.

However, for heteroscedastic kernels, the dependence on a_(M) ^(i) (and a_(M) ^(i+k) for following elements) in the conditioning statement of the property in FIG. 3 has the consequence that we cannot assume that this condition is satisfied for a_(M) ^(i+k) once it is satisfied for a_(M) ^(i). The problem is that for some a_(M) ^(i+k), the scaling factor α(a_(p) ^(new),a_(M) ^(i+k)) may be greater than the factor for a_(M) ^(i). We can fix this problem by a slight modification to the criterion, as follows. Let α denote the largest α used for scaling the kernels in a_(M)∪{a_(p) ^(new)}. It is now guaranteed that the remaining elements in a_(M) cannot be stolen once we observe that

$\begin{matrix} {{D\left( {a_{p},a_{M}^{i}} \right)} < {\frac{1}{2\; \overset{\_}{\alpha}}{D_{\Sigma}\left( {a_{p},a_{p}^{new}} \right)}}} & (18) \end{matrix}$

The heteroscedastic criterion is slightly less efficient than the homoscedastic criterion, because α≦1 (with equality in the homoscedastic case). Fortunately, α decreases as the number of anchors increases, and, the savings by not having to evaluate all elements in an anchor may be significant when there are many anchors, because most of the old anchors discover immediately that none or only few of their elements can be stolen by a new anchor under construction.

Another reason for the slightly less efficient heteroscedastic criterion is the answer to how we efficiently keep track of ā. We will in addition to maintaining the sorted list of elements a_(M) also be maintaining a sorted (from high to low) list of scaling factors α_(M). Each element a_(M) ^(i)εa_(M) may be linked to its corresponding a_(M) ^(j)εa_(M), so when an element a_(M) ^(i) is stolen from a_(M), the linked a_(M) ^(j) is stolen from a_(M) as well. Hence, α may be efficiently found as the greater of the scaling factors for the pivot in the new anchor under construction and the first element in α_(M) for the old anchor currently being investigated.

The agglomeration of anchors into the anchor sub-tree may now be considered. The tree is constructed bottom-up by letting each anchor represent a leaf or node in the tree and then recursively merging the pair of most compatible nodes into a parent node until only one root node is left.

The compatibility of two nodes may be defined as follows. Notice that by construction, all element in an anchor are within a guaranteed distance of D_(Σ)(a_(p),a_(M) ¹)=α(a_(p),a_(M) ¹)D (a_(p),a_(M) ¹). In other words, in the metric space with a Mahalanobis distance metric, all elements in an anchor are enclosed in a ball that is centered at a_(p) and has Mahalanobis radius D_(Σ)(a_(p),a_(M) ¹). The compatability of two nodes is now measured by the radius of the smallest ball that completely contains the balls for the two nodes, and the most compatible nodes are those resulting in the smallest radius. By observing Mahalanobis distances along this line, it is a simple geometric argument that a line between the furthest apart points on two balls passes through the center of the balls. It therefore follows that the enclosing ball will have characteristics as defined by the property defined in FIG. 4.

The Block Constrained Variational Distribution

Given the data and kernel partition trees, and a mutually exclusive and exhaustive partition for the product space of data and kernels into blocks {(A₁, B₁), . . . , (A_(P), B_(P))}, where A_(P) is a part of the data and B_(p) is a part of the kernels, corresponding to nodes in the two Respective trees—as previously defined in FIG. 2. This partition may group data and kernels together for which kernels are unable to differentiate between the data in any significant way. Our variational distribution q will then for all the data in a block assign the same variational membership for all the mixture components defined by the kernels in the block. That is,

q(m|n)=q(B _(p) |A _(p)) for all (n,m)ε(A _(p) ,B _(p))  (18)

where (n, m)ε(A_(p), B_(p)) is a notational convenient representation for (x_(n), K_(μ) _(m) _(Σ) _(m) )ε(A_(p), B_(p)). FIG. 2 illustrates a block partition where, for example, q(1|4)=q(1|5)=q(2|4)=q(2|5).

The variational distributions q(·|n) are standard probability distributions and must therefore satisfy the sum-to-one constraint Σ_(m)q(m|n)=1. The positivity constraints q(m|n)>0 are automatically satisfied during estimation of q and are therefore not treated as explicit constraints. By integrating the block constraints in (18) into these standard sum-to-one constraints, we obtain the following N constraints

Σ_(p;nεAp) |B _(p) |q(B _(p) |A _(p))=1 for all nε{1, . . . , N}  (19)

The constraint associated with data point n involves exactly the q(B_(p)|A_(p)) for which nεA_(p) in the block partition. In FIG. 2, the variational probabilities involved in a constraint are therefore those q(B_(p)|A_(p)) encountered on the column under a data partition leaf in the block partition table. For example, for the data partition leaf corresponding to data point n=1, the corresponding constraint is

q(1|1)+q(2|1)+q(3|1)+3q(4−6|1−2)+3q(7−9|1−3)=1  (19)

where q(m₁−m₂|n₁−n₂) is the probability assigned to the block with kernels m₁ through m₂ and data points n₁ through n₂.

The constraints in (19) complicates the optimization of q in the E-step, because they are intertwined in the sense that some variational probabilities involved in one constraint may also appear as a part of another constraint. Fortunately, this complication can be alleviated, because it turns out that it is possible to represent the block constraints in the form of a so-called marked partition tree or MPT.

FIG. 5 illustrates an example embodiment 500 showing a marked partition tree (MPT). A MPT is similar to the data partition tree, but with an additional encoding of data partitions that relates to specific components in a mixture model, as follows. If A_(p) is a data block in the partition for mixture component m, then the corresponding node in the tree is marked by an identifier for that component. Now, recall that the kernels in the kernel density estimate for each data point correspond to the components in a mixture model. We, accordingly, create a MPT from a block partition by simply marking the node in the data partition tree corresponding to A_(p) with the mark B_(p) for each block (A_(p), B_(p)) in the partition. For example, for the block partition in FIG. 2 the MPT will appear as in FIG. 5. The MPT of FIG. 5 has a group of kernels in a block to be marked together instead of insisting on individual marks for each mixture component defined by the kernels in the block. For example, the marks B_(p)ε{1−2,7−9} may be used instead of the marks B_(p)ε{1,2,7,8,9} for A_(p)=4−5 in FIG. 5. This difference may add to computational efficiency, as we will see later.

By construction, each path from a leaf n to the root in the MPT has the property that it marks all B_(p) for which nεA_(p) and (A_(p),B_(p)) is in the block partition. The MPT therefore also defines the constraints in (19). However, the MPT representation may have an additional benefit of representing the constraints in an explicit tree structure that is exploited during the efficient estimation of q in the E-step.

As a final caveat, it is in fact possible that the lowest marked nodes in the tree are not leaf nodes. If A_(p) is such a node, all constraints in (19) for nεA_(p) will be redundant. In this case the MPT construction process proceeds by pruning all unmarked nodes without marked descendants from the tree, which in effect reduces the number of constraints in (19) to the lower number of non-redundant constraints associated with paths from leaves to the root in the reduced tree.

Block Selection

The (A_(p),B_(p)) blocks in FIG. 2 may be constructed by a recursive traversal of the data and kernel partition trees. Let a denote a node in the data partition tree and b denote a node in the kernel partition tree. Starting from a and b being the roots of their respective trees, the blocks may be constructed by a call to the recursive function of FIG. 6.

Here Block (a, b) creates one of the (A_(p), B_(p)) blocks in FIG. 2.

The CanBlock(a, b, γ) is a criterion which determines if a and b are far enough apart that the group of kernels in b is unable to differentiate between a group of data in a to a certain degree determined by a user-specified parameter γ.

Letting α and α denote the smallest and largest factor used to scale the reference bandwidth Σ for the kernels in node b. The distance from any kernel in b to any data point in a (from the view point of b) cannot be smaller than

$\begin{matrix} {{D_{\min}\left( {a,b} \right)} = {\max \left\{ {0,{{\frac{1}{\overset{\_}{\alpha}}{D_{\Sigma}\left( {a_{p},b_{p}} \right)}} - {\frac{1}{\underset{\_}{\alpha}}{D_{\Sigma}\left( {a_{p},a_{M}^{1}} \right)}} - {D\left( {b_{p},b_{M}^{1}} \right)}}} \right\}}} & (20) \end{matrix}$

and the distance cannot be larger than

$\begin{matrix} {{D_{\max}\left( {a,b} \right)} = {{\frac{1}{\underset{\_}{\alpha}}{D_{\Sigma}\left( {a_{p},b_{p}} \right)}} + {\frac{1}{\underset{\_}{\alpha}}{D_{\Sigma}\left( {a_{p},a_{M}^{1}} \right)}} + {D\left( {b_{p},b_{M}^{1}} \right)}}} & (21) \end{matrix}$

A CanBlock criterion can now be defined as a threshold for the relative distance between kernel evaluations based on the smallest and largest possible distances from b to a

$\begin{matrix} {\frac{{\frac{1}{Z_{\Sigma_{\min}}}K\left( D_{\min} \right)} - {\frac{1}{Z_{\Sigma_{\max}}}{K\left( D_{\max} \right)}}}{\frac{1}{Z_{\Sigma_{\max}}}{K\left( D_{\max} \right)}} < \gamma} & (22) \end{matrix}$

where Σ_(min) and Σ_(rna), denote the unknown bandwidths leading to the smallest and largest distances. For the heteroscedastic Gaussian kernels the criterion becomes

$\begin{matrix} {\frac{{\left( \frac{\Sigma_{\max}}{\Sigma_{\min}} \right)^{{- 1}/2}{K\left( D_{\min} \right)}} - {K\left( D_{\max} \right)}}{K\left( D_{\max} \right)} < \gamma} & (23) \end{matrix}$

By the restriction of heteroscedasity to scaled bandwidths, Σ_(min)=α_(min)Σ and Σ_(max)=α_(max)Σ. The unknown factor in (23) can therefore be bounded as follows

$\begin{matrix} {\left( \frac{\Sigma_{\max}}{\Sigma_{\min}} \right)^{{- 1}/2} = {{\left( \frac{\alpha_{\max}}{\alpha_{\min}} \right)^{{- d}/2}{\Sigma }^{{- 1}/2}} \leq {\left( {\underset{\_}{\alpha}/\overset{\_}{\alpha}} \right)^{- \frac{d}{2}}{\Sigma }^{- \frac{1}{2}}}}} & (24) \end{matrix}$

Using this bound instead of the unknown factor in (23) guarantees that the threshold γ for the CanBlock criterion will not be violated.

For homoscedastic kernels α= α=1, and the criterion in this case simplifies to

$\begin{matrix} {\frac{{K\left( D_{\min} \right)} - {K\left( D_{\max} \right)}}{K\left( D_{\max} \right)} < \gamma} & (25) \end{matrix}$

Efficient Variational E-Step

With the block constraints in (18) on the variational distribution, the decomposition of F(x, q) in (7) reduces to

$\begin{matrix} {{{F\left( {x,q} \right)} = {\sum\limits_{p = 1}^{P}{{A_{p}}{B_{p}}{{q\left( {B_{p}\text{|}A_{p}} \right)}\left\lbrack {{{- \log}\; {q\left( {B_{p}\text{|}A_{p}} \right)}} + {H\left( B_{p} \right)} + {G\left( {B_{p}\text{|}A_{p}} \right)}} \right\rbrack}}}}\mspace{79mu} {{where},{{{for}\mspace{14mu} {p(m)}} = {1/M}},}} & (26) \\ {\mspace{79mu} {{{H\left( B_{p} \right)} = {{\frac{1}{B_{p}}{\sum\limits_{m \in B_{p}}{\log \; {p(m)}}}} = {- {\log (M)}}}}\mspace{20mu} {and}}} & (27) \\ {\mspace{20mu} {{G\left( B_{p} \middle| A_{p} \right)} = {\frac{1}{{A_{p}}{B_{p}}}{\sum\limits_{n \in B_{p}}{\sum\limits_{m \in B_{p}}{\log \; {p\left( {{x_{n}\text{|}\mu_{m}},\sum\limits_{m}} \right)}}}}}}} & (28) \end{matrix}$

Notice that the block-specific geometric mean G(A|B) is the only part of F(x, q) that depends on specific data and kernel parameters.

The constraints in (19) may then be integrated into the maximization of F(x, q) with respect to q by use of the Lagrange multiplier method. Introducing the Lagrange multipliers λ=(λ₁, . . . , λ_(N)) for the constraints, we construct the Lagrangian

$\begin{matrix} {{F\left( {x,q,\lambda} \right)} = {{F\left( {x,q} \right)} + {\sum\limits_{n = 1}^{N}{\lambda_{n}\left( {{\sum\limits_{p:{n \in A_{p}}}{{B_{p}}{q\left( B_{p} \middle| A_{p} \right)}}} - 1} \right)}}}} & (29) \end{matrix}$

By setting

$\begin{matrix} {\frac{\partial{F\left( {x,q,\lambda} \right)}}{\partial\left( {B_{p}A_{p}} \right)} = 0} & \; \end{matrix}$

and solving all for q (B_(p)|A_(p)) we obtain the solution

$\begin{matrix} {{q_{\lambda}\left( {B_{p}\text{|}A_{p}} \right)} = {{\exp\left( {{\frac{1}{A_{p}}{\sum\limits_{p:{n \in A_{p}}}\lambda_{n}}} - 1} \right)}{\exp \left( {H\left( B_{p} \right)} \right)}{\exp \left( {G\left( {B_{p}\text{|}A_{p}} \right)} \right)}}} & (30) \end{matrix}$

where the λ subscript in q_(λ)(B_(p)|A_(p)) indicates that this solution is still a function of λ.

A closed-form solution to the constrained optimization problem can now be found by inserting q_(λ)(B_(p)|A_(p)) into the constraints in (19) and then exploit the tree structure of these constraints, as represented by the MPT, to gradually solve for λ. Following, by inserting this λ into (30), we finally obtain the optimal value for the variational probabilities q_(λ)(B_(p)|A_(p)), p=1, . . . ,.

An algorithmic description of the E-step may be found in FIG. 7.

Sufficient Statistics

It is key to the computational efficiency of the variational E-step that sufficient statistics for G (B_(p)|A_(p)) factorizes into data-specific statistics and kernel-specific statistics in this way avoiding computation of statistics on the product space of data and kernels. These statistics can therefore be pre-computed separately and stored at the nodes in the respective partition trees prior to the actual mode-seeking process. For heteroscedastic Gaussian kernels with profile defined on the Mahalanobis distance

$\begin{matrix} \begin{matrix} {{G\left( {B_{p}\text{|}A_{p}} \right)} = {\frac{1}{{A_{p}}{B_{p}}}{\sum\limits_{n \in A_{p}}{\sum\limits_{m \in B_{p}}{\log \; {p\left( {{x_{n}\text{|}\mu_{m}},\sum\limits_{m}} \right)}}}}}} \\ {= {c - {\frac{1}{2}\begin{bmatrix} {{\frac{1}{B_{p}}{\sum\limits_{m \in B_{p}}{\log {\sum\limits_{m}}}}} +} \\ {\frac{1}{{A_{p}}{B_{p}}}{\sum\limits_{n \in A_{p}}{\sum\limits_{m \in B_{p}}{\left( {x_{n} - \mu_{m}} \right)^{T}{\sum\limits_{m}^{- 1}\left( {x_{n} - \mu_{m}} \right)}}}}} \end{bmatrix}}}} \\ {= {c - {\frac{1}{2}\left\lbrack {{\langle{\log {\sum\limits_{m}}}\rangle}_{B_{p}} + {\frac{1}{B_{p}}{\sum\limits_{m \in B_{p}}\begin{pmatrix} {{\mu_{m}^{T}{\overset{- 1}{\sum\limits_{m}}\mu_{m}}} +} \\ {{\langle{x_{n}^{T}{\overset{- 1}{\sum\limits_{m}}x_{n}}}\rangle}_{A_{p}} -} \\ {2\; \mu_{m}^{T}{\overset{- 1}{\sum\limits_{m}}{\langle x_{n}\rangle}_{A_{p}}}} \end{pmatrix}}}} \right\rbrack}}} \\ {= {c - {\frac{1}{2}\begin{bmatrix} {{\langle{\log {\sum\limits_{m}}}\rangle}_{B_{p}} + {\langle{\mu_{m}^{T}{\overset{- 1}{\sum\limits_{m}}\mu_{m}}}\rangle}_{B_{p}} +} \\ {{\frac{1}{B_{p}}{\sum\limits_{m \in B_{p}}{\langle{x_{n}^{T}{\overset{- 1}{\sum\limits_{m}}x_{n}}}\rangle}_{A_{p}}}} - {2{\langle\; {\mu_{m}^{T}\overset{- 1}{\sum\limits_{m}}}\rangle}_{B_{p}}{\langle x_{n}\rangle}_{A_{p}}}} \end{bmatrix}}}} \\ {= {c - {\frac{1}{2}\begin{bmatrix} {{\langle{\log {\sum\limits_{m}}}\rangle}_{B_{p}} + {\langle{\mu_{m}^{T}{\overset{- 1}{\sum\limits_{m}}\mu_{m}}}\rangle}_{B_{p}} +} \\ {{{Tr}\left( {{\langle\overset{- 1}{\sum\limits_{m}}\rangle}_{B_{p}}{\langle{x_{n}^{T}x_{n}}\rangle}_{A_{p}}} \right)} - {2{\langle\; {\mu_{m}^{T}\overset{- 1}{\sum\limits_{m}}}\rangle}_{B_{p}}{\langle x_{n}\rangle}_{A_{p}}}} \end{bmatrix}}}} \end{matrix} & (31) \end{matrix}$

where

$c = {{- \frac{1}{2}}{Nd}\; {\log \left( {2\; \pi} \right)}}$

with d being the number of features in a data point, <·>_(Ap) and <·>_(Bp) denote averages over nεA_(p) and mεB_(p), respectively, and Tr(·) denotes the trace of a matrix. Notice that the sufficient statistics factorizes into data-specific and kernel-specific parts.

For the homoscedastic Gaussian kernels, G (B_(p)|A_(p)) simplifies to

$\begin{matrix} {{G\left( B \middle| A \right)} = {c - {\frac{1}{2}\begin{bmatrix} {{\log {\sum }} + {{Tr}\left( {\sum\limits^{- 1}{\langle{\mu_{m}^{T}\mu_{m}}\rangle}_{B_{p}}} \right)} +} \\ {{{Tr}\left( {\sum\limits^{- 1}{\langle{x_{n}^{T}x_{n}}\rangle}_{A_{p}}} \right)} - {2{\langle\mu_{m}^{T}\rangle}_{B_{p}}{\sum\limits^{- 1}{\langle x_{n}\rangle}_{A_{p}}}}} \end{bmatrix}}}} & (32) \end{matrix}$

Efficient Variational M-Step

The nodes in the MPT will after the E-step in the t′th iteration of the algorithm contain the computed q^(t+1) (B_(p)|A_(p)). Recall that q^(t+1) (m|A_(p))=q^(t+1) (B_(p)|A_(p)) for all mεB_(p). The update for each x_(n) ^(t) can now be efficiently computed by following the path from leaf n to the root in the MPT, while computing the update for the three different variational mode-seeking algorithms as follows.

For heteroscedastic kernels an efficient update for mean shift becomes

$\begin{matrix} {{{x_{n}^{t + 1}\left( {{\sum\limits_{p:{n \in A_{p}}}{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}}} < {\overset{- 1}{\sum\limits_{m}} >_{B_{p}}}} \right)}^{1}*{\sum\limits_{p:{n \in A_{p}}}{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}}}} < {\overset{- 1}{\sum\limits_{m}}\mu_{m}} >_{B_{p}}} & (32) \end{matrix}$

For homoscedastic kernels this expression simplifies to

$\begin{matrix} {x_{n}^{t + 1} = {\sum\limits_{p:{n \in A_{p}}}\frac{{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}} < \mu_{m} >_{B_{p}}}{\sum\limits_{p:{n \in A_{p}}}^{\;}{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}}}}} & (33) \end{matrix}$

Notice that the statistics <E_(m) ⁻¹>B_(p), <E_(m) ⁻¹μ_(m)>B_(p), and <μ_(m)>B_(p) have already been pre-computed and stored in the kernel partition tree prior to the mode seeking.

Medoid Shift

For heteroscedastic kernels the efficient variational M-step corresponding to the update in (4) can utilize the following equality

$\begin{matrix} {{\sum\limits_{m = 1}^{M}{{q\left( {m\text{|}n} \right)}{D_{\sum_{m}}\left( {x,\mu_{m}} \right)}}} = {\sum\limits_{p:{n \in A_{p}}}{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}\left( {< {\mu_{m}{\overset{- 1}{\sum\limits_{m}}\mu_{m}^{T}}} >_{B_{p}}{{+ {{Tr}\left( {{\langle\overset{- 1}{\sum\limits_{m}}\rangle}_{B_{p}}x^{T}x} \right)}} - 2} < {\mu_{m}^{T}\overset{- 1}{\sum\limits_{m}}} >_{B_{p}}x} \right)}}} & (34) \end{matrix}$

For homoscedastic kernels the equality simplifies as

$\begin{matrix} {{\sum\limits_{m = 1}^{M}{{q\left( {m\text{|}n} \right)}{D_{\sum_{m}}\left( {x,\mu_{m}} \right)}}} = {\sum\limits_{p:{n \in A_{p}}}{{B_{p}}{q^{t + 1}\left( {B_{p}\text{|}A_{p}} \right)}\left( {{Tr}\left( {\sum\limits^{- 1}{< {\mu_{m}\mu_{m}^{T}} >_{B_{p}}{{+ {{Tr}\left( {\sum\limits^{- 1}{x^{T}x}} \right)}} - 2} < \mu_{m}^{T} >_{B_{p}}{\sum\limits^{- 1}x}}} \right)} \right.}}} & (35) \end{matrix}$

Notice again that all statistics have already been pre-computed and stored in the kernel partition tree prior to the mode seeking.

Quick Shift

The M-step in quick shift computes the same type of quantities as the M-step for medoid shift. However, instead of the minimization over all data points (as in (*b)), quick shift only investigates nearest neighbors until a guaranteed nearest neighbor which decreases the value is found. This nearest neighbor search can be efficiently conducted by considering the leaf in the data partition tree that represents x_(n) and then backtrack through the tree. During this backtracking phase the knowledge about the pivot and the furthest distance that any data point covered by a node in the tree can have from its pivot can be used to not consider parts of the tree, where it is guaranteed that a closer candidate to the current best candidate cannot be found.

Notice that for Mediod shift and Quick shift that each dynamic data point is shifted to a location that corresponds to a location of a data point in the initial data set. Once a shift has been computed for all initial data points in the data set, the result of the next shift has already been computed. In such situations, the first iteration is expressly computed, while remaining iterations are implicitly derived from the first iteration.

Non-Euclidian Distances

Above, the unified variational view on mode seeking was presented in a Euclidean setting, with the Mahalanobis distance measure introduced from a proper inner product. This view may be extended to a non-Euclidian setting via a finitely positive semi-definite function κ(·,·): XxX

, where X may be a non-Euclidian input space. The finitely positive definiteness of κ(·,·) ensures the existence of a decomposition into an inner product of its arguments under a feature mapping into a Hilbert space

,φ(·):X→

, i.e.:

κ(x,y)=φ(x)^(T)φ(y)  (36)

Assuming an explicit representation of the mapping φ(·), which will allow the mode-seeking algorithms to be performed in the Hilbert space.

The G (B_(p)|A_(p)) statistics used in the variational mode-seeking algorithms explicitly depends on the induced distance measure in Hilbert space, which can be computed as

D(x,y)=K(x,x)+κ(y,y)−2κ(x,y)  (37)

The bilinearity of the inner product ensures that these statistics factorize into data-specific and kernel-specific parts, which is useful to store these statistics in the two partition trees.

For example, consider the cosine similarity measure

$\begin{matrix} {{\kappa \left( {x,y} \right)} = {{\cos \left( {x,y} \right)} = \frac{x^{T}y}{{x}{y}}}} & (38) \end{matrix}$

where x,y are non-negative vectors. The cosine similarity induces an explicit embedding

${\varphi (x)} = \frac{x}{x}$

into a Hilbert space with a proper distance measure

D(φ(x),φ(y))=2−2κ(x,y)  (39)

The expression for G(B_(p)|A_(p)) now simplifies to:

$\begin{matrix} {{G\left( {B_{p}\text{|}A_{p}} \right)} = {c - \left\lbrack {{1 -} < \frac{x^{T}}{x} > A_{p} < \frac{\mu_{m}}{\mu_{m}} > B_{p}} \right\rbrack}} & (40) \end{matrix}$

by exploiting the explicit mapping into the Hilbert space and the bilinearity of the inner product. Notice that the sufficient statistics used to compute G(B_(p)|A_(p)) again factorizes into data-specific and kernel-specific parts.

FIG. 8 is a flowchart illustration of an embodiment 800 showing a method for variational mode seeking Embodiment 800 is an example method by which a set of data points may be analyzed to determine clusters of data using the processes described above.

Other embodiments may use different sequencing, additional or fewer steps, and different nomenclature or terminology to accomplish similar functions. In some embodiments, various operations or set of operations may be performed in parallel with other operations, either in a synchronous or asynchronous manner. The steps selected here were chosen to illustrate some principles of operations in a simplified form.

In block 802, a hierarchical kernel partition may be created. The kernels that are used in this hierarchical partition are defined in such a way that each kernel's location parameter corresponds to a different data point in the data set.

In block 804, the data set may be copied into a set of dynamic data points. The dynamic data points may be moved during a variational expectation-maximization process, and finally the original data points may be assigned to clusters in accordance with the modes in the underlying kernel density function that the dynamic data points moves to, as follows.

In block 806, an iterative step may begin. The process may iterate to convergence.

A hierarchical partition of the dynamic data points may be generated in block 808, and a block partition may be generated in block 810. The block partition in block 810 may be defined by an iterative procedure that involves both the hierarchical kernel and data partitions. An example of this procedure may be the procedure described in embodiment 600. An example of the blocks in a resulting block partition may be the block partition table 206 of embodiment 200.

Within each of the blocks of the block partition table 206, all of the data points may be considered to be similar enough that they can be processed via pre-computed sufficient statistics through the variational expectation and maximization steps. In this manner, the variational expectation and maximization steps may be performed.

The variational expectation step may be performed in block 812. An example of the expectation step may be found earlier in this specification in the section entitled “Efficient Variational E-step” and an algorithmic description in FIG. 7.

A variational maximization step may be performed in block 814. The maximization step may be performed using mean shift, mediod shift, quick shift, or other maximization steps. Examples of such steps may be found earlier in this specification in the section entitled “Efficient Variational M-step”.

If the convergence has not occurred in block 816, the process may return to block 808 for more analysis.

Once convergence has occurred in block 816, the clustering of the original data points may be determined according to the grouping of associated dynamic data points at modes in the underlying kernel density function in block 818.

The foregoing description of the subject matter has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the subject matter to the precise form disclosed, and other modifications and variations may be possible in light of the above teachings. The embodiment was chosen and described in order to best explain the principles of the invention and its practical application to thereby enable others skilled in the art to best utilize the invention in various embodiments and various modifications as are suited to the particular use contemplated. It is intended that the appended claims be construed to include other alternative embodiments except insofar as limited by the prior art. 

1. A mode seeking method for generating clusters of data points within a given fixed data set, said method comprising: generating a hierarchical partition of kernels, each of said kernels having a fixed location at a data point in said fixed data set; creating a copy of said data set to create a set of dynamic data points; iteratively moving said dynamic data points to approximate kernel density modes by iteratively until convergence performing a first method comprising: generating an initial hierarchical partition of said dynamic data points; generating a block partition in the product space of fixed kernels and dynamic data points, a block in said block partition containing one or more kernels grouped together by said hierarchical partition of said fixed kernels and one or more dynamic data points grouped together by said hierarchical partition of said dynamic data points; performing an expectation step to compute a variational distribution assigning the same variational membership probability to all fixed kernels within a block for all dynamic data points in the same said block from said block partition; performing a variational maximization step that updates said dynamic data points towards said kernel density modes; at convergence, assigning data points to same cluster when corresponding dynamic data points have converged to a same mode.
 2. The method of claim 1, trajectories of said dynamic data points constrained to pass through data points within said fixed data set, said method comprising: explicit computation of shifts for all said dynamic data points in first iteration; and implicit computation of shifts for all said dynamic data point in a subsequent iteration.
 3. The method of claim 1, said block partition in the product space of fixed kernels and dynamic points being created by a recursive method comprising: starting from a root of both said kernel partition tree and said data partition tree, expanding the granularity of either one group of kernels in a current kernel partition or one group of data in a current data partition until a predefined criterion is satisfied.
 4. The method of claim 1, said kernel and data partition trees being anchor trees.
 5. The method of claim 1, said block partition being represented as a marked partition tree.
 6. The method of claim 1, the kernels having Gaussian profiles.
 7. The method of claim 6, the distance measure used in kernels being Mahalanobis distance.
 8. The method of claim 6, the distance measure used in kernels being induced from a finitely positive semi-definite function, and the mode-seeking algorithm performed under an explicit feature mapping into Hilbert space.
 9. The method of claim 8, the finitely positive semi-definite function being cosine similarity defined for non-negative feature vectors.
 10. A device comprising: a processor; executable instructions operable by said processor, said executable instructions comprising a mode seeking method for generating clusters of data points within a given fixed data set, said method comprising: generating a hierarchical partition of kernels, each of said kernels having a fixed location at a data point in said fixed data set; creating a copy of said data set to create a set of dynamic data points; iteratively moving said dynamic data points to approximate kernel density modes by iteratively until convergence performing a first method comprising: generating an initial hierarchical partition of said dynamic data points; generating a block partition in the product space of fixed kernels and dynamic data points, a block in said block partition containing one or more kernels grouped together by said hierarchical partition of said fixed kernels and one or more dynamic data points grouped together by said hierarchical partition of said dynamic data points; performing an expectation step to compute a variational distribution assigning the same variational membership probability to all fixed kernels within a block for all dynamic data points in same said block from said block partition; performing a variational maximization step that updates said dynamic data points towards said kernel density modes; at convergence, assigning data points to same cluster when corresponding dynamic data points have converged to a same mode.
 11. The device of claim 10, at least two of said kernels having differently scaled bandwidths.
 12. The device of claim 10, all of said kernels having the same bandwidth.
 13. The device of claim 10, said hierarchical partition of data and kernels comprising: a data partition tree defining summarizing statistics about groups of said data in partition; a kernel partition tree summarizing defining statistics about said kernels in partitions; and using at least one of these statistics in at least one of the variational E-step and the variational M-step.
 14. The device of claim 13, said statistics comprising at least one summarizing statistic stored in said kernel partition tree.
 15. The device of claim 14, said maximization step comprising a variational mean shift maximization step.
 16. The device of claim 14, said maximization step comprising a variational medoid shift maximization step.
 17. The device of claim 14, said maximization step comprising a variational quick shift maximization step.
 18. A computer readable medium comprising computer executable instructions that perform a mode seeking method for generating clusters of data points within a given fixed data set, said method comprising: generating a hierarchical partition of kernels, each of said kernels having a fixed location at a data point in said fixed data set; creating a copy of said data set to create a set of dynamic data points; iteratively moving said dynamic data points to approximate kernel density modes by iteratively until convergence performing a first method comprising: generating an initial hierarchical partition of said dynamic data points; generating a block partition in the product space of fixed kernels and dynamic data points, a block in said block partition containing one or more kernels grouped together by said hierarchical partition of said fixed kernels and one or more dynamic data points grouped together by said hierarchical partition of said dynamic data points; representing said block partition in a marked partition tree; performing an expectation step to compute a variational distribution assigning the same variational membership probability to all fixed kernels within a block for all dynamic data points in same said block from said block partition; performing a variational maximization step that updates said dynamic data points towards said kernel density modes; at convergence, assigning data points to same cluster when corresponding dynamic data points have converged to a same mode.
 19. The computer readable medium of claim 18, said expectation step solving a constrained optimization problem comprising: traversing a marked partition tree representation for said block partition within said product space from a lowest level to a highest level and reducing nested constraints to a single equation at said highest level involving one Lagrange multiplier and solving said single equation to determine said Lagrange multiplier; traversing said marked partition tree representation from said highest level to said lowest level and resolving Lagrange multipliers for each of said partitions by back-substitution of a previously resolved Lagrange multiplier; and using said Lagrange multipliers to determine said variational distribution.
 20. The computer readable medium of claim 19, said method further comprising: the number of constraints in said constrained optimization being a number of lowest marked nodes in said marked partition tree. 